Let us go back to basics and model the association between a set of explanatory variables (covariates, independent variables, right-hand-side-variables, \(X\)’s, etc …) and a response variable (dependent variable, left-hand-side-variable, or just simply …) \(Y\) using the following linear relationship:
\[Y = \beta_0 + \beta_1X_1 + \cdots + \beta_pX_p + \epsilon,\] where \(\epsilon\) is a random variable typically assumed to be normally distributed. If \(Y\) is a continuous variable we call this a regression problem.
If \(Y\) is a binary (dummy, zero-one) variable, we use the logistic regression:
\[\textrm{logit}(P(Y = 1|X_1,\ldots,X_p)) = \beta_0 + \beta_1X_1 + \cdots + \beta_pX_p.\] We call this a classification problem.
- In classical econometrics we are often interested in statistical and causal inference.
- In modern data science we are more interested in predictions, but this distinction should not be taken too far.
- We will focus on the classification problem in the examples today, because it is easier to visualize.